Annotations and Tools for an Activity Based Spoken Language Corpus

نویسندگان

Jens Allwood

Leif Grönqvist

Elisabeth Ahlsén

Magnus Gunnarsson

چکیده

The paper contains a description of the Spoken Language Corpus of Swedish at the Department of Linguistics, Göteborg University (GSLC), and a summary of the various types of analysis and tools that have been developed for work on this corpus. Work on the corpus was started in the late 1970:s. It is incrementally growing and presently consists of 1.3 million words from about 25 different social activities. The corpus was initiated to meet a growing interest in naturalistic spoken language data. It is based on the fact that spoken language varies considerably in different social activities with regard to pronunciation, vocabulary, grammar and communicative functions. The goal of the corpus is to include spoken language from as many social activities as possible to get a more complete understanding of the role of language and communication in human social life. This type of spoken language corpus is still fairly unique even for English, since many spoken language corpora (certainly for Swedish) have been collected for special purposes, like speech recognition, phonetics, dialectal variation or interaction with a computerized dialog system in a very narrow domain, e.g. (Map Task (Isard and Carletta (1995), TRAINS (Heeman and Allen 1994), Waxholm (Blomerg et al. 1993). Compared to English corpora, the Göteborg corpus is most similar to the Wellington Corpus of Spoken New Zealand English (Holmes, Vine and Johnson 1998), but also has traits in common with the BNC, the London/Lund corpus (Svartvik 1990) and the Danish BySoc corpus (Gregersen 1991, Henrichsen 1997). The corpus is based on audio (50%) or video/audio (50%) recordings of naturalis tically occurring interactions.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Harvesting Dutch Trees: Syntactic Properties of Spoken Dutch

In this paper, we report on quantitative research into certain word order phenomena in Dutch. In our research, we use the Spoken Dutch Corpus (CGN), a major new resource for research into contemporary spoken Dutch. After briefly introducing the primary data, the annotations added, and some of the tools to explore the primary data and the annotations, we illustrate how the Corpus may be utilized...

متن کامل

SplaSH (spoken language search hawk): integrating time-aligned with text-aligned annotations

In this work we present SpLaSH (Spoken Language Search Hawk), a toolkit used to perform complex queries on spoken language corpora. In SpLaSH, tools for the integration of time aligned annotations (TMA), by means of annotation graphs, with text aligned ones (TXA), by means of generic XML files, are provided. SpLaSH imposes a very limited number of constraints to the data model design, allowing ...

متن کامل

Vague Language and Interpersonal Communication: An Analysis of Adolescent Intercultural Conversation

This paper is concerned with the analysis of the spoken language of teenagers, taken from a newly developed specialised corpus the British and Taiwanese Teenage Intercultural Communication Corpus (BATTICC). More specifically, the study employs a discourse analytical approach to examine vague language in an intercultural context among a group of British and Taiwanese adolescents, paying particul...

متن کامل

Annotating progressive aspect constructions in the spoken section of the British National Corpus

We present a set of stand-off annotations for the ninety thousand sentences in the spoken section of the British National Corpus (BNC) which feature a progressive aspect verb group. These annotations may be matched to the original BNC text using the supplied document and sentence identifiers. The annotated features mostly relate to linguistic form: subject type, subject person and number, form ...

متن کامل

Standoff Coordination for Multi-Tool Annotation in a Dialogue Corpus

The LUNA corpus is a multi-lingual, multidomain spoken dialogue corpus currently under development that will be used to develop a robust natural spoken language understanding toolkit for multilingual dialogue services. The LUNA corpus will be annotated at multiple levels to include annotations of syntactic, semantic, and discourse information; specialized annotation tools will be used for the a...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2001

Annotations and Tools for an Activity Based Spoken Language Corpus

نویسندگان

چکیده

منابع مشابه

Harvesting Dutch Trees: Syntactic Properties of Spoken Dutch

SplaSH (spoken language search hawk): integrating time-aligned with text-aligned annotations

Vague Language and Interpersonal Communication: An Analysis of Adolescent Intercultural Conversation

Annotating progressive aspect constructions in the spoken section of the British National Corpus

Standoff Coordination for Multi-Tool Annotation in a Dialogue Corpus

عنوان ژورنال:

اشتراک گذاری